AITopics | roman urdu

Collaborating Authors

roman urdu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Evaluating Large Language Models on Urdu Idiom Translation

Khan, Muhammad Farmal, Akter, Mousumi

arXiv.org Artificial IntelligenceOct-21-2025

Idiomatic translation remains a significant challenge in machine translation, especially for low resource languages such as Urdu, and has received limited prior attention. To advance research in this area, we introduce the first evaluation datasets for Urdu to English idiomatic translation, covering both Native Urdu and Roman Urdu scripts and annotated with gold-standard English equivalents. We evaluate multiple open-source Large Language Models (LLMs) and Neural Machine Translation (NMT) systems on this task, focusing on their ability to preserve idiomatic and cultural meaning. Automatic metrics including BLEU, BERTScore, COMET, and XCOMET are used to assess translation quality. Our findings indicate that prompt engineering enhances idiomatic translation compared to direct translation, though performance differences among prompt types are relatively minor. Moreover, cross script comparisons reveal that text representation substantially affects translation quality, with Native Urdu inputs producing more accurate idiomatic translations than Roman Urdu.

large language model, machine learning, translation, (16 more...)

arXiv.org Artificial Intelligence

2510.1746

Country: North America > United States > New Mexico (0.15)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Hussain, Nisar, Qasim, Amna, Mehak, Gull, Zain, Muhammad, Hafeez, Momina, Sidorov, Grigori

arXiv.org Artificial IntelligenceOct-13-2025

The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.03683

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing

Ahmad, Muhammad, Waqas, Muhammad, Hamza, Ameer, Batyrshin, Ildar, Sidorov, Grigori

arXiv.org Artificial IntelligenceJun-30-2025

Hope is a positive emotional state involving the expectation of favorable future outcomes, while hope speech refers to communication that promotes optimism, resilience, and support, particularly in adverse contexts. Although hope speech detection has gained attention in Natural Language Processing (NLP), existing research mainly focuses on high-resource languages and standardized scripts, often overlooking informal and underrepresented forms such as Roman Urdu. To the best of our knowledge, this is the first study to address hope speech detection in code-mixed Roman Urdu by introducing a carefully annotated dataset, thereby filling a critical gap in inclusive NLP research for low-resource, informal language varieties. This study makes four key contributions: (1) it introduces the first multi-class annotated dataset for Roman Urdu hope speech, comprising Generalized Hope, Realistic Hope, Unrealistic Hope, and Not Hope categories; (2) it explores the psychological foundations of hope and analyzes its linguistic patterns in code-mixed Roman Urdu to inform dataset development; (3) it proposes a custom attention-based transformer model optimized for the syntactic and semantic variability of Roman Urdu, evaluated using 5-fold cross-validation; and (4) it verifies the statistical significance of performance gains using a t-test. The proposed model, XLM-R, achieves the best performance with a cross-validation score of 0.78, outperforming the baseline SVM (0.75) and BiLSTM (0.76), with gains of 4% and 2.63% respectively.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.21583

Country:

Asia > Pakistan (0.04)
South America (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.69)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

ERUPD -- English to Roman Urdu Parallel Dataset

Furqan, Mohammed, Khaja, Raahid Bin, Habeeb, Rayyan

arXiv.org Artificial IntelligenceDec-23-2024

Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine translation, sentiment analysis, and multilingual education.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2412.17562

Country:

Asia > India > Telangana > Hyderabad (0.05)
Europe > Czechia > Prague (0.04)
Asia > Middle East > Oman (0.04)
(4 more...)

Genre:

Research Report (0.65)
Overview (0.46)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Decoding Abusive language detection from social media comments using deep learning approaches

#artificialintelligenceApr-3-2022, 07:35:04 GMT

Why do we need this: Abusive language is an expression that contains abusive or dirty words in conversation (oral or text). With the increase in the culture of social media and social networking sites, the use of abusive language has increased rapidly. Every day, millions of comments are posted on the uploaded posts. Abusive language in online comments initiates cyber-bullying that targets individuals (celebrity, politician, product, etc.) and a group of people (specific country, age, religion, etc.). How to do this: Manually it is almost impossible to detect and filter out abusive comments from massive online comments.

abusive language, dl model, ml model, (14 more...)

#artificialintelligence

Country: Asia > Pakistan (0.05)

Genre: Research Report (0.34)

Industry: Information Technology (0.35)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hate Speech Detection in Roman Urdu

Khan, Moin, Shahzad, Khurram, Malik, Kamran

arXiv.org Artificial IntelligenceAug-5-2021

Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus. The tweets in the this corpus are classified at three levels: Neutral-Hostile, Simple-Complex, and Offensive-Hate speech. As another contribution, we have used five supervised learning techniques, including a deep learning technique, to evaluate and compare their effectiveness for hate speech detection. The results show that Logistic Regression outperformed all other techniques, including deep learning techniques for the two levels of classification, by achieved an F1 score of 0.906 for distinguishing between Neutral-Hostile tweets, and 0.756 for distinguishing between Offensive-Hate speech tweets.

accessed, guideline, tweet, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3414524

2108.0283

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Oceania > New Zealand (0.04)
Europe > France (0.04)
(24 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Services (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback

An Attention Based Neural Network for Code Switching Detection: English & Roman Urdu

Hussain, Aizaz, Arshad, Muhammad Umair

arXiv.org Artificial IntelligenceMar-3-2021

Code-switching is a common phenomenon among people with diverse lingual background and is widely used on the internet for communication purposes. In this paper, we present a Recurrent Neural Network combined with the Attention Model for Language Identification in Code-Switched Data in English and low resource Roman Urdu. The attention model enables the architecture to learn the important features of the languages hence classifying the code switched data. We demonstrated our approach by comparing the results with state of the art models i.e. Hidden Markov Models, Conditional Random Field and Bidirectional LSTM. The models evaluation, using confusion matrix metrics, showed that the attention mechanism provides improved the precision and accuracy as compared to the other models.

dataset, mirza omer beg, roman urdu, (11 more...)

arXiv.org Artificial Intelligence

2103.02252

Country: Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.04)

Genre: Research Report (0.69)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)

Add feedback

NUBOT: Embedded Knowledge Graph With RASA Framework for Generating Semantic Intents Responses in Roman Urdu

Shabbir, Johar, Arshad, Muhammad Umair, Shahzad, Waseem

arXiv.org Artificial IntelligenceFeb-20-2021

The understanding of the human language is quantified by identifying intents and entities. Even though classification methods that rely on labeled information are often used for the comprehension of language understanding, it is incredibly time consuming and tedious process to generate high propensity supervised datasets. In this paper, we present the generation of accurate intents for the corresponding Roman Urdu unstructured data and integrate this corpus in RASA NLU module for intent classification. We embed knowledge graph with RASA Framework to maintain the dialog history for semantic based natural language mechanism for chatbot communication. We compare results of our work with existing linguistic systems combined with semantic technologies. Minimum accuracy of intents generation is 64 percent of confidence and in the response generation part minimum accuracy is 82.1 percent and maximum accuracy gain is 96.7 percent. All the scores refers to log precision, recall, and f1 measure for each intents once summarized for all. Furthermore, it creates a confusion matrix represents that which intents are ambiguously recognized by approach.

chatbot, knowledge graph, rasa framework, (13 more...)

arXiv.org Artificial Intelligence

2102.1041

Country:

Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.05)
North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.73)

Add feedback

Sentiment Analysis for YouTube Comments in Roman Urdu

Tehreem, Tooba

arXiv.org Artificial IntelligenceFeb-19-2021

Sentiment analysis is a vast area in the Machine learning domain. A lot of work is done on datasets and their analysis of the English Language. In Pakistan, a huge amount of data is in roman Urdu language, it is scattered all over the social sites including Twitter, YouTube, Facebook and similar applications. In this study the focus domain of dataset gathering is YouTube comments. The Dataset contains the comments of people over different Pakistani dramas and TV shows. The Dataset contains multi-class classification that is grouped The comments into positive, negative and neutral sentiment. In this Study comparative analysis is done for five supervised learning Algorithms including linear regression, SVM, KNN, Multi layer Perceptron and Na\"ive Bayes classifier. Accuracy, recall, precision and F-measure are used for measuring performance. Results show that accuracy of SVM is 64 percent, which is better than the rest of the list.

dataset, mirza omer beg, sentiment analysis, (11 more...)

arXiv.org Artificial Intelligence

2102.10075

Country:

Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.05)
Europe > Middle East > Cyprus (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Add feedback

A Clustering Framework for Lexical Normalization of Roman Urdu

Khan, Abdul Rafae, Karim, Asim, Sajjad, Hassan, Kamiran, Faisal, Xu, Jia

arXiv.org Artificial IntelligenceMar-31-2020

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.

dataset, normalization, variation, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1017/S1351324920000285

2004.00088

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Africa > Middle East > Egypt > Giza Governorate > Giza (0.05)
(32 more...)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback